Telecom Client Case Study - Cluster Code

Unsupervised Learning

Introduction

In this project we will try to understand a data set of customers on a company based on unsupervised learning tools. Because of the huge amount of data we have, first we will do some Data Cleaning and some Feature Analysis; then some PCA and with it we will summarize the data. And finally Factor Analysis and clustering to get results on what this clients are and if we can group them based on their characteristics.

Our main goal is to see if the churn variable (variable that characterizes the customers if they leave the telecom company or not) has any relationship with our analysis, and if we can classify the customers without using the churn variable and they correlate.

Get the data

First we load the data. The data set was downloaded from a Kaggle post based on a telecom company. The purpose of this data set was to predict the “churn” variable, but we will try to understand the clients into groups and see if they correspond to this variable.

customers.df = read.csv("dataset.csv", stringsAsFactors = TRUE)

glimpse(customers.df)
## Rows: 100,000
## Columns: 100
## $ rev_Mean         <dbl> 23.9975, 57.4925, 16.9900, 38.0000, 55.2300, 82.2750,…
## $ mou_Mean         <dbl> 219.25, 482.75, 10.25, 7.50, 570.50, 1312.25, 0.00, 6…
## $ totmrc_Mean      <dbl> 22.500, 37.425, 16.990, 38.000, 71.980, 75.000, 16.99…
## $ da_Mean          <dbl> 0.2475, 0.2475, 0.0000, 0.0000, 0.0000, 1.2375, 0.000…
## $ ovrmou_Mean      <dbl> 0.00, 22.75, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 419.…
## $ ovrrev_Mean      <dbl> 0.0000, 9.1000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ vceovr_Mean      <dbl> 0.0000, 9.1000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ datovr_Mean      <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
## $ roam_Mean        <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
## $ change_mou       <dbl> -157.25, 532.25, -4.25, -1.50, 38.50, 156.75, 0.00, 1…
## $ change_rev       <dbl> -18.9975, 50.9875, 0.0000, 0.0000, 0.0000, 8.1450, -0…
## $ drop_vce_Mean    <dbl> 0.6666667, 8.3333333, 0.3333333, 0.0000000, 9.6666667…
## $ drop_dat_Mean    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ blck_vce_Mean    <dbl> 0.6666667, 1.0000000, 0.0000000, 0.0000000, 0.6666667…
## $ blck_dat_Mean    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ unan_vce_Mean    <dbl> 6.3333333, 61.3333333, 2.6666667, 0.0000000, 77.00000…
## $ unan_dat_Mean    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ plcd_vce_Mean    <dbl> 52.333333, 263.333333, 9.000000, 3.666667, 222.333333…
## $ plcd_dat_Mean    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ recv_vce_Mean    <dbl> 42.3333333, 69.0000000, 0.3333333, 1.3333333, 94.6666…
## $ recv_sms_Mean    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ comp_vce_Mean    <dbl> 45.000000, 193.333333, 6.000000, 3.666667, 137.000000…
## $ comp_dat_Mean    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ custcare_Mean    <dbl> 0.0000000, 1.6666667, 0.0000000, 0.0000000, 8.6666667…
## $ ccrndmou_Mean    <dbl> 0.0000000, 6.3333333, 0.0000000, 0.0000000, 15.000000…
## $ cc_mou_Mean      <dbl> 0.0000000, 5.4633333, 0.0000000, 0.0000000, 11.076666…
## $ inonemin_Mean    <dbl> 18.0000000, 53.0000000, 0.3333333, 1.3333333, 66.0000…
## $ threeway_Mean    <dbl> 0.0000000, 0.3333333, 0.0000000, 0.0000000, 0.0000000…
## $ mou_cvce_Mean    <dbl> 90.643333, 189.396667, 5.426667, 8.410000, 285.233333…
## $ mou_cdat_Mean    <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.000…
## $ mou_rvce_Mean    <dbl> 97.1766667, 55.2800000, 0.0000000, 0.4133333, 106.330…
## $ owylis_vce_Mean  <dbl> 0.0000000, 46.3333333, 0.0000000, 0.3333333, 14.66666…
## $ mouowylisv_Mean  <dbl> 0.00000000, 24.21666667, 0.00000000, 0.25666667, 10.8…
## $ iwylis_vce_Mean  <dbl> 0.0000000, 6.3333333, 0.0000000, 0.0000000, 0.6666667…
## $ mouiwylisv_Mean  <dbl> 0.00000000, 3.69666667, 0.00000000, 0.00000000, 0.366…
## $ peak_vce_Mean    <dbl> 58.0000000, 83.6666667, 5.0000000, 1.3333333, 97.3333…
## $ peak_dat_Mean    <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000…
## $ mou_peav_Mean    <dbl> 132.600000, 75.333333, 5.193333, 3.380000, 173.476667…
## $ mou_pead_Mean    <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.000…
## $ opk_vce_Mean     <dbl> 24.0000000, 157.0000000, 1.0000000, 3.6666667, 90.333…
## $ opk_dat_Mean     <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ mou_opkv_Mean    <dbl> 55.2200000, 169.3433333, 0.2333333, 5.4500000, 218.08…
## $ mou_opkd_Mean    <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.0…
## $ drop_blk_Mean    <dbl> 1.3333333, 9.3333333, 0.3333333, 0.0000000, 10.333333…
## $ attempt_Mean     <dbl> 52.333333, 263.333333, 9.000000, 3.666667, 222.333333…
## $ complete_Mean    <dbl> 45.000000, 193.333333, 6.000000, 3.666667, 137.000000…
## $ callfwdv_Mean    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ callwait_Mean    <dbl> 0.3333333, 5.6666667, 0.0000000, 0.0000000, 0.0000000…
## $ churn            <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ months           <int> 61, 56, 58, 60, 57, 59, 53, 53, 55, 57, 59, 53, 55, 5…
## $ uniqsubs         <int> 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 1, 2, 5, 2, 1, 3,…
## $ actvsubs         <int> 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2,…
## $ new_cell         <fct> U, N, Y, Y, Y, Y, N, Y, Y, N, Y, Y, Y, Y, Y, N, N, N,…
## $ crclscod         <fct> A, EA, C, B, A, C, A, B, B, A, A, A, BA, A, C, B, B, …
## $ asl_flag         <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,…
## $ totcalls         <int> 1652, 14654, 7903, 1502, 4485, 26812, 6279, 4491, 167…
## $ totmou           <dbl> 4228.000, 26400.000, 24385.053, 3065.000, 14028.000, …
## $ totrev           <dbl> 1504.62, 2851.68, 2155.91, 2000.90, 2181.12, 4033.98,…
## $ adjrev           <dbl> 1453.44, 2833.88, 1934.47, 1941.81, 2166.48, 3932.90,…
## $ adjmou           <dbl> 4085.00, 26367.00, 24303.05, 3035.00, 13965.00, 40295…
## $ adjqty           <int> 1602, 14624, 7888, 1479, 4452, 26362, 6271, 4470, 165…
## $ avgrev           <dbl> 29.66, 51.53, 34.54, 40.45, 38.69, 83.68, 58.95, 34.7…
## $ avgmou           <dbl> 83.37, 479.40, 433.98, 63.23, 249.38, 857.34, 334.06,…
## $ avgqty           <dbl> 32.69, 265.89, 140.86, 30.81, 79.50, 560.89, 120.60, …
## $ avg3mou          <int> 272, 305, 12, 8, 558, 1260, 0, 633, 973, 6, 90, 18, 1…
## $ avg3qty          <int> 116, 158, 7, 3, 191, 960, 0, 96, 465, 3, 16, 13, 625,…
## $ avg3rev          <int> 30, 40, 17, 38, 55, 80, 17, 39, 90, 30, 60, 36, 80, 2…
## $ avg6mou          <int> 322, 477, 11, 50, 586, 1187, 0, 719, 915, 54, 123, 36…
## $ avg6qty          <int> 136, 275, 6, 25, 196, 853, 0, 112, 434, 7, 32, 26, 62…
## $ avg6rev          <int> 38, 48, 17, 40, 80, 78, 17, 58, 90, 34, 64, 36, 76, 2…
## $ prizm_social_one <fct> S, U, S, T, U, U, C, , S, C, C, C, U, C, S, S, T, U, …
## $ area             <fct> NORTHWEST/ROCKY MOUNTAIN AREA, CHICAGO AREA, GREAT LA…
## $ dualband         <fct> Y, N, N, N, Y, N, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, N,…
## $ refurb_new       <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,…
## $ hnd_price        <dbl> 149.98999, NA, 29.98999, 29.98999, 149.98999, 129.989…
## $ phones           <int> 2, 7, 2, 1, 6, 9, 4, 3, 3, 2, 3, 4, 9, 2, 10, 5, 3, 6…
## $ models           <int> 2, 6, 1, 1, 4, 4, 3, 2, 3, 2, 3, 3, 5, 2, 6, 4, 2, 5,…
## $ hnd_webcap       <fct> WCMB, WC, NA, NA, WCMB, WCMB, NA, WCMB, NA, WCMB, WCM…
## $ truck            <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ rv               <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ownrent          <fct> O, , O, , R, O, O, R, , O, O, O, O, , O, O, O, O, O, …
## $ lor              <int> 15, 1, 7, 6, 5, 1, 8, 0, 3, 8, 15, 4, 15, 1, 11, 4, 3…
## $ dwlltype         <fct> S, S, S, M, M, , S, S, M, S, S, S, S, S, S, M, S, S, …
## $ marital          <fct> S, S, M, M, S, S, M, M, M, M, S, A, S, U, S, U, U, S,…
## $ adults           <int> 1, 1, 2, 4, 1, 1, 3, 2, 3, 2, 5, 3, 3, 1, 3, 3, 2, 5,…
## $ infobase         <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M,…
## $ income           <int> 4, 5, 5, 6, 6, 6, 9, 1, 4, 9, 6, 9, 5, 7, 3, 1, 3, 4,…
## $ numbcars         <int> 3, 1, 2, 1, 1, 1, NA, 2, 1, 2, NA, NA, NA, NA, 2, NA,…
## $ HHstatin         <fct> C, C, C, C, C, C, I, C, C, I, C, , B, A, I, , C, C, C…
## $ dwllsize         <fct> A, A, A, D, O, , A, A, E, A, A, A, A, A, A, , A, A, A…
## $ forgntvl         <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ethnic           <fct> N, Z, N, U, I, U, N, S, F, N, N, N, J, N, N, N, N, N,…
## $ kid0_2           <fct> U, U, U, Y, U, U, U, U, U, U, Y, U, U, U, U, U, U, U,…
## $ kid3_5           <fct> U, U, Y, U, U, U, U, U, U, U, U, U, U, U, U, Y, U, U,…
## $ kid6_10          <fct> U, U, U, U, U, U, U, U, U, U, U, Y, U, U, Y, U, U, U,…
## $ kid11_15         <fct> U, U, U, U, U, U, U, U, U, U, U, Y, U, U, U, U, U, U,…
## $ kid16_17         <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, Y,…
## $ creditcd         <fct> Y, Y, Y, Y, Y, Y, Y, Y, N, Y, Y, Y, Y, N, Y, Y, Y, Y,…
## $ eqpdays          <int> 361, 240, 1504, 1812, 434, 458, 852, 231, 700, 601, 4…
## $ Customer_ID      <int> 1000001, 1000002, 1000003, 1000004, 1000005, 1000006,…

At first sight, we can see a lot of factor variables and a ton of NA.

Data Cleaning

First, we will take care of the missing values.

hist(rowMeans(is.na(customers.df)), xlab = c("Missing values average by rows"), main = c())

We can see there are a some rows that have a lot of empty values on it. That is, there are some customers that do not have much information.

Then we will look at the missing values by columns.

indexesEmptyCols = which(colMeans(is.na(customers.df)) != 0)
colsWithNA = sort(colMeans(is.na(customers.df[, indexesEmptyCols])), decreasing = TRUE)
barplot(colsWithNA, las=2)

Here we can clearly see that for the 100 columns that we have, there are five columns that contribute for the most missing values. So we will first remove those columns.

indexesColsToRemove = c()
for (i in names(colsWithNA)[1:5]) {
    indexesColsToRemove = c(indexesColsToRemove, which(names(customers.df) == i))
}

customers.df = customers.df[, -indexesColsToRemove]

Now that the columns are almost cleaned we will look at all the rows that have missing values.

indexesEmptyRows = which(rowMeans(is.na(customers.df))!= 0)
print(length(indexesEmptyRows))
## [1] 6069

There are 6000 rows that have some missing values. As it is too small compared to the total number of rows we will remove them.

customers.df = customers.df[-indexesEmptyRows,]

Now, with columns and rows almost cleaned, we are going to look at individual values and see which ones are empty. To do that, first we have to remove the empty levels on the factors.

print(length(which(is.na(customers.df) == TRUE)))
## [1] 0
rm(list = setdiff(ls(), "customers.df"))

Now we are going to take care of the factor variables, because often there are a lot of missing values hidden on those. So let’s get all the levels for each factor variable.

customers.df = droplevels.data.frame(customers.df)

colsWithEmptyFactors.df = data.frame(colName = character(), index = integer(), naNumber = integer(),
                                     stringsAsFactors = FALSE)

for (i in 1:ncol(customers.df)) {
    levels = levels(customers.df[, i])
    if ("" %in% levels) {
        colsWithEmptyFactors.df = rbind(colsWithEmptyFactors.df, data.frame(colName = names(customers.df)[i], index = i, naNumber = 0,
                                                                            stringsAsFactors = FALSE))
    }
}

For each factor column we will numerate the number of missing values.

for (i in 1:nrow(colsWithEmptyFactors.df)) {
    index = colsWithEmptyFactors.df$index[i]
    
    level = levels(customers.df[, index])
    level[which(level == "")] = NA
    levels(customers.df[, index]) = level
    
    colsWithEmptyFactors.df[i, 3] = length(which(is.na(customers.df[, index]) == TRUE))
}
print(colsWithEmptyFactors.df)
##            colName index naNumber
## 1 prizm_social_one    71     6334
## 2             area    72       38
## 3          ownrent    80    30225
## 4         dwlltype    81    28518
## 5         infobase    83    19206
## 6         HHstatin    84    34122
## 7         dwllsize    85    34639

Because all the factors except the area or prizm_social_one have too many NAs, then we will remove those columns.

customers.df = customers.df[, -colsWithEmptyFactors.df$index[3:nrow(colsWithEmptyFactors.df)]]

Now, we will get all the rows with missing values.

indexesEmptyRows = which(rowMeans(is.na(customers.df))!= 0)
length(indexesEmptyRows)
## [1] 6370

Because the number is relatively small, we will delete those rows.

customers.df = customers.df[-indexesEmptyRows,]

print(length(which(is.na(customers.df) == TRUE)))
## [1] 0

No there are no more NAs, and we are left with a pretty big data set to work on. And finally we look at the duplicated clients.

length(which(duplicated(customers.df) == TRUE))
## [1] 0

And there are non.

Feature Engineering

After looking at the data, we have pretty much all the variables that we need to work one, but there are some steps we have to take care first:

  • Transform some numeric variables into factors because they are only 0 or 1.
  • Create a separate data set without the churn variable (variable to predict) that will be the one we work on.
  • Enumerate the rows of the data set with the Customers ID so we can identify each.
  • Remove unnecessary variables with a correlation matrix, and remove some that are also not important.
REMOVE = "remove"
TO_FACTOR = "toFactor"

editColumn = function(colNames, df, toDo) {
    for (col in colNames) {
        index = which(names(df) == col)
        
        if (!is_empty(index)) {
            if (toDo == REMOVE) {
                df = df[, -index]
            } else if (toDo == TO_FACTOR) {
                df[, index] = as.factor(df[, index])
            } else { errorCondition("Could not perform that action.")}
        }
    }
    
    return(df)
}

customers.df = editColumn(c("forgntvl", "truck", "rv"), customers.df, TO_FACTOR)

customers.df = droplevels.data.frame(customers.df)

row.names(customers.df) = customers.df$Customer_ID

customers_original.df = customers.df

customers.df = editColumn(c("churn", "Customer_ID"), customers.df, REMOVE)

After removing the predicted variables, and transforming into factors we will clean the environment.

rm(list = setdiff(ls(), c("customers.df", "customers_original.df")))

Understand the data

Let’s plot all the data set to look for outliers.

boxplot(customers.df, las=2, col="darkblue")

We can see that there are some variables that are the same height as other, and also there are some outliers. So let’s see if scaling the data helps.

customers_numeric.df = select_if(customers.df, is.numeric)
customers_not_numeric.df = select_if(customers.df, negate(is.numeric))

boxplot(scale(customers_numeric.df), las=2, col="darkblue")

We can see a little bit of improvement, with less outliers. However, we are losing all the variance that the variables have, so let’s see the correlation between them.

R = cor(customers_numeric.df)

ggcorr(customers_numeric.df, cor_matrix = R, label = TRUE)

Because of the immense amount of variables, it is only visible if the plot is zoomed. But we can see there are a lot of correlated variables so we will remove them (only one if two are strongly correlated). This makes sense because there are a lot of variables such as average monthly use over 3, 6, etc months.

R[upper.tri(R)] = 0
diag(R) = 0
 
customers_numeric_no_corr.df =  customers_numeric.df[, !apply(R, 2, function(x) any(abs(x) > 0.9, na.rm = TRUE))]

rm(R)

With this we can see that 22 variables were removed.

PCA

First of all, lets compute the PCA scaled and not scaled to see if we can see any difference in the results.

pca = prcomp(customers_numeric_no_corr.df)
summary(pca)
## Importance of components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     3871.0475 536.18650 531.94380 274.37625 231.89810
## Proportion of Variance    0.9509   0.01824   0.01796   0.00478   0.00341
## Cumulative Proportion     0.9509   0.96915   0.98711   0.99188   0.99530
##                              PC6       PC7       PC8      PC9     PC10     PC11
## Standard deviation     154.90526 133.51307 110.74285 88.09443 53.09050 44.61102
## Proportion of Variance   0.00152   0.00113   0.00078  0.00049  0.00018  0.00013
## Cumulative Proportion    0.99682   0.99795   0.99873  0.99922  0.99940  0.99953
##                            PC12     PC13     PC14     PC15     PC16     PC17
## Standard deviation     37.65613 33.26910 30.81608 27.63984 23.80900 22.18829
## Proportion of Variance  0.00009  0.00007  0.00006  0.00005  0.00004  0.00003
## Cumulative Proportion   0.99962  0.99969  0.99975  0.99979  0.99983  0.99986
##                            PC18     PC19     PC20     PC21     PC22     PC23
## Standard deviation     21.09227 19.10587 16.31055 14.50608 14.08652 12.59224
## Proportion of Variance  0.00003  0.00002  0.00002  0.00001  0.00001  0.00001
## Cumulative Proportion   0.99989  0.99991  0.99993  0.99994  0.99996  0.99997
##                            PC24    PC25  PC26  PC27 PC28  PC29  PC30  PC31
## Standard deviation     12.03064 9.67548 8.714 7.441 6.95 6.135 4.932 3.987
## Proportion of Variance  0.00001 0.00001 0.000 0.000 0.00 0.000 0.000 0.000
## Cumulative Proportion   0.99997 0.99998 1.000 1.000 1.00 1.000 1.000 1.000
##                         PC32  PC33 PC34  PC35  PC36  PC37  PC38  PC39   PC40
## Standard deviation     3.459 3.254  2.6 1.843 1.686 1.393 1.196 1.004 0.9752
## Proportion of Variance 0.000 0.000  0.0 0.000 0.000 0.000 0.000 0.000 0.0000
## Cumulative Proportion  1.000 1.000  1.0 1.000 1.000 1.000 1.000 1.000 1.0000
##                          PC41   PC42   PC43   PC44  PC45   PC46   PC47
## Standard deviation     0.9578 0.7785 0.7314 0.5075 0.471 0.3659 0.3372
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.000 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000 1.000 1.0000 1.0000
##                             PC48
## Standard deviation     1.347e-09
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

In the PCA without the scaling we can see the first component describes the most variance of the model, up to 95%.

fviz_screeplot(pca, addlabels = TRUE)

fviz_contrib(pca, choice = "var", axes = 1)

Without scaling now we clearly see that the only important variable is adjqty (adjusted total number of calls) and that is not ideal for a model, because with more than 48 variables just one cannot be the only important one.

Now, let’s do the PCA scaling the data.

pca_scaled = prcomp(customers_numeric_no_corr.df, scale. = TRUE)
summary(pca_scaled)
## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     3.7682 1.87026 1.68027 1.41888 1.3926 1.33714 1.31562
## Proportion of Variance 0.2958 0.07287 0.05882 0.04194 0.0404 0.03725 0.03606
## Cumulative Proportion  0.2958 0.36868 0.42750 0.46945 0.5099 0.54710 0.58316
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.25567 1.17608 1.04803 1.00326 1.00004 0.98491 0.97153
## Proportion of Variance 0.03285 0.02882 0.02288 0.02097 0.02084 0.02021 0.01966
## Cumulative Proportion  0.61600 0.64482 0.66770 0.68867 0.70951 0.72972 0.74938
##                           PC15    PC16    PC17    PC18    PC19   PC20    PC21
## Standard deviation     0.93902 0.91884 0.90249 0.89385 0.88124 0.8542 0.82717
## Proportion of Variance 0.01837 0.01759 0.01697 0.01665 0.01618 0.0152 0.01425
## Cumulative Proportion  0.76775 0.78534 0.80231 0.81895 0.83513 0.8503 0.86459
##                          PC22    PC23    PC24   PC25    PC26    PC27    PC28
## Standard deviation     0.7959 0.78104 0.74472 0.6999 0.67444 0.64110 0.62981
## Proportion of Variance 0.0132 0.01271 0.01155 0.0102 0.00948 0.00856 0.00826
## Cumulative Proportion  0.8778 0.89049 0.90205 0.9123 0.92173 0.93029 0.93855
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.60046 0.55939 0.53118 0.50361 0.48078 0.47309 0.45717
## Proportion of Variance 0.00751 0.00652 0.00588 0.00528 0.00482 0.00466 0.00435
## Cumulative Proportion  0.94607 0.95259 0.95846 0.96375 0.96856 0.97323 0.97758
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.43751 0.41427 0.39620 0.34701 0.33725 0.32269 0.25189
## Proportion of Variance 0.00399 0.00358 0.00327 0.00251 0.00237 0.00217 0.00132
## Cumulative Proportion  0.98157 0.98514 0.98841 0.99092 0.99329 0.99546 0.99678
##                           PC43    PC44    PC45    PC46     PC47      PC48
## Standard deviation     0.24544 0.22522 0.16081 0.13258 0.003562 1.474e-10
## Proportion of Variance 0.00126 0.00106 0.00054 0.00037 0.000000 0.000e+00
## Cumulative Proportion  0.99804 0.99909 0.99963 1.00000 1.000000 1.000e+00
fviz_screeplot(pca_scaled, addlabels = TRUE)

fviz_contrib(pca_scaled, choice = "var", axes = 1)

We see that the scaled PCA is much more descriptive than the not scaled. In the non scaled PCA, adjqty (adjusted total number of calls) is the only important variable and there is only one principal component. Whereas in the scaled we have more principal components and more variables. It is true that using the scaled version removes the variance of the sample and also we lose a lot of explained variability of the model, but it is much more descriptive.

Now, let’s look at the other principal components.

fviz_contrib(pca_scaled, choice = "var", axes = 2)

Here, there are less variables that contribute to the PC, but we can realize more or less the variables that are important.

fviz_contrib(pca_scaled, choice = "var", axes = 3)

With the data scaled, we can see that the variance of the model due to the principal components is much more spread. And with the first 3, we can only explain about 50% of the variance of the model, but there are a lot more variables that are important. And we will use this variables to clustering the data into groups. Otherwise we will only be left with one variable and that is not accurate.

rm(pca)

We will see if there are customers that highlight from others.

head(get_pca_ind(pca_scaled)$contrib[,1], n = 20)
##      1000001      1000003      1000004      1000005      1000006      1000007 
## 3.025519e-04 8.996525e-04 9.920473e-04 1.988757e-04 1.736548e-02 5.163934e-04 
##      1000009      1000010      1000011      1000012      1000013      1000014 
## 2.607328e-03 7.915901e-04 5.365838e-06 3.980850e-04 3.083818e-04 8.475781e-04 
##      1000015      1000016      1000017      1000018      1000019      1000020 
## 7.630032e-03 6.659053e-05 1.949787e-05 8.853703e-05 7.616919e-04 9.712061e-05 
##      1000021      1000022 
## 1.071428e-02 1.152903e-02

And there are not.

Now, look at each specific customer to see the percentage of variance they contribute.

head(sort(decreasing = TRUE, pca_scaled$x[,1]^2)/(pca_scaled$sdev[1]^2))/dim(customers_numeric.df)[1]
##     1011660     1049099     1001811     1000652     1071600     1000318 
## 0.003392062 0.002252876 0.002085621 0.001727743 0.001629091 0.001613368
fviz_contrib(pca_scaled, choice = "ind", axes = 1)

If we do not pay attention to the customers id, we can see that there are some really important customers that explain the most of the variance. However, as we assume that all customers are the same, we cannot know exactly what this means. However maybe if we compare that to the churn variable we can conclude that those are the ones quitting.

What about the top customers?

fviz_contrib(pca_scaled, choice = "ind", axes = 1, top=200)

Here we can tell that the customer contribution variation is like an inverse exponential without that steep slope. That is there are some clients that stand out, but not that many. There are some differences between the customers but not a huge one.

customers_numeric_no_corr.df[order(get_pca_ind(pca_scaled)$contrib[,1], decreasing = TRUE)[1:20], ]
##         totmrc_Mean  da_Mean vceovr_Mean datovr_Mean roam_Mean change_mou
## 1011660    209.9900 159.3900    155.1750      0.8250    1.9000   -1070.25
## 1049099    199.9900   8.9100    494.0000      0.3900    0.4875   -2254.25
## 1001811     42.9425   2.9700    744.0625      0.0000    0.0000   -2785.00
## 1000652     75.0000   0.9900    432.0750      0.0000    0.0000     135.75
## 1071600    142.4900   5.6925    285.2750      0.9250   36.5675    -403.75
## 1000318    202.0050  55.1925    215.1375      0.0975    0.0975     105.75
## 1085979    209.9900   4.4550      0.0000      0.0000   20.0925    -416.50
## 1063639    341.9250  72.7650     68.2500      0.0000    0.0000     529.25
## 1000572     80.0000  26.7300    188.7250      0.0000    0.0000    -932.25
## 1070494    138.9900   9.4050    183.7750      1.2375    3.8650    1054.50
## 1098072    148.4500   5.1975    285.8000      0.7875    1.3650     345.00
## 1038142     84.9800  57.6675    104.1875      4.5825    9.4975     189.00
## 1039871     75.0000   4.4550    268.8125      1.7550    0.6475     948.75
## 1001077    210.3925   7.6725     83.4375      0.7500   12.1350    -355.00
## 1013042    136.2000  30.9375    890.7625      0.0000   37.0850   -2530.75
## 1028321     56.2500   2.9700     50.1250      0.2925    3.0200   -1291.50
## 1069899    210.3925  17.5725     54.8875      0.0000   16.5400    -322.50
## 1079185     82.4875   0.4950    424.9000      0.3900    0.2925    1295.75
## 1053185     84.9900   0.2475    273.8125      0.0000   11.2800     199.50
## 1057745    195.3925   0.0000    127.3875      0.0000    7.4575     940.25
##         change_rev drop_vce_Mean drop_dat_Mean blck_vce_Mean blck_dat_Mean
## 1011660  -154.8250    117.333333      0.000000      9.000000     0.0000000
## 1049099  -478.4850     60.000000      0.000000     58.666667     0.0000000
## 1001811  -524.0600     27.333333      0.000000     40.333333     0.0000000
## 1000652    97.2600     78.666667      0.000000      9.666667     0.0000000
## 1071600  -129.8225    106.666667      0.000000     13.000000     0.0000000
## 1000318   -63.5975      7.666667      0.000000     64.000000     0.0000000
## 1085979   -23.6875     19.666667      0.000000    199.000000     0.3333333
## 1063639   -53.0750     35.000000      0.000000    144.000000     0.0000000
## 1000572   -74.9875      9.333333      0.000000     35.000000     0.0000000
## 1070494    12.3675     88.000000      0.000000      1.666667     0.0000000
## 1098072  -228.8100     73.333333      0.000000      4.666667     0.0000000
## 1038142     2.2325     69.000000      1.333333     13.666667     0.0000000
## 1039871   146.9200     94.000000      0.000000      8.333333     0.0000000
## 1001077   -25.7825     70.333333      1.666667     10.666667     2.0000000
## 1013042  -768.5750     54.666667      0.000000    239.000000     0.0000000
## 1028321  -120.5400     57.666667      0.000000     10.333333     0.0000000
## 1069899   -85.1425    130.666667      0.000000     24.000000     0.0000000
## 1079185   389.8225    130.666667      0.000000     18.000000     0.0000000
## 1053185    70.1500     72.000000      0.000000      9.666667     0.0000000
## 1057745    39.0525     75.000000      0.000000     44.000000     0.0000000
##         unan_vce_Mean unan_dat_Mean recv_sms_Mean custcare_Mean cc_mou_Mean
## 1011660      326.0000     0.0000000             0     7.6666667   43.096667
## 1049099      318.6667     0.0000000             0     2.3333333    4.523333
## 1001811      317.0000     0.0000000             0     0.3333333    0.540000
## 1000652      321.0000     0.0000000             0     0.0000000    0.000000
## 1071600      316.3333     0.0000000             0    20.6666667   18.576667
## 1000318      157.0000     0.0000000             0     5.3333333   21.690000
## 1085979      373.6667     0.3333333             0     9.0000000   39.380000
## 1063639      126.6667     0.0000000             0    28.6666667   25.840000
## 1000572      230.3333     0.0000000             0     0.0000000    0.000000
## 1070494      222.3333     0.0000000             0    20.3333333   67.120000
## 1098072      292.0000     0.0000000             0     0.0000000    0.000000
## 1038142      679.6667     0.0000000             0    20.6666667   52.000000
## 1039871      308.6667     0.0000000             0    31.3333333   42.416667
## 1001077      227.6667     0.0000000             0     2.0000000    5.433333
## 1013042      164.0000     0.0000000             0    19.0000000  168.886667
## 1028321      170.6667     0.0000000             0     2.0000000    2.290000
## 1069899      419.0000     0.0000000             0     3.0000000    2.423333
## 1079185      814.3333     0.0000000             0    93.0000000  126.473333
## 1053185      447.3333     0.0000000             0    17.6666667   26.350000
## 1057745      232.6667     0.3333333             0    35.3333333  149.860000
##         inonemin_Mean threeway_Mean mou_cvce_Mean mou_rvce_Mean owylis_vce_Mean
## 1011660      693.3333     4.3333333     2753.6333     1957.5600       179.00000
## 1049099      755.3333     0.3333333     1079.8633     1926.7400       379.00000
## 1001811      833.0000     0.0000000     1100.7733     2140.3967       437.00000
## 1000652      283.0000     1.6666667     3701.3633      807.7567       285.00000
## 1071600      489.0000     0.6666667     2361.7533     1327.5300       517.00000
## 1000318      216.3333     0.6666667     2248.9300     1666.5833       343.00000
## 1085979     1075.3333     5.3333333     1197.8633     1492.1900       167.66667
## 1063639      166.0000    10.3333333     2335.1700     1443.1933       393.66667
## 1000572      533.3333     0.3333333     1725.8633     1382.2467       197.00000
## 1070494      779.3333    12.0000000     1433.9333     1956.2267       195.66667
## 1098072     1145.3333     4.3333333      741.6933     1139.4733       315.66667
## 1038142     1093.6667    26.0000000     1458.8933     1502.3000       287.00000
## 1039871     1323.0000     4.0000000     1127.6367     1518.8200        98.33333
## 1001077      741.0000     1.0000000     1987.4967     1820.2667       389.66667
## 1013042      184.0000     6.6666667     2385.5433     1015.5267       118.66667
## 1028321     1852.3333     0.3333333      917.8867     1193.6467       540.66667
## 1069899      452.0000    16.6666667     1676.6933      927.7367       292.66667
## 1079185      297.0000     5.3333333     1763.1867      575.0700       644.33333
## 1053185      448.3333    12.3333333     1910.7700      541.0900       610.33333
## 1057745      336.3333     6.6666667     2846.6700     1617.8833       211.33333
##         mouowylisv_Mean iwylis_vce_Mean mouiwylisv_Mean peak_dat_Mean
## 1011660        346.0033        41.00000       104.98000     4.6666667
## 1049099        277.5133       376.66667       476.07667     0.0000000
## 1001811        306.2033       519.33333       743.19000     0.0000000
## 1000652        328.1833        36.00000        65.67000     0.0000000
## 1071600        725.2700       282.00000       659.60667     3.3333333
## 1000318        549.3900       141.66667       434.37000     0.3333333
## 1085979        144.6100       174.66667       151.76667     0.3333333
## 1063639        554.0000        48.33333        99.85000     0.0000000
## 1000572        149.7767       281.66667       387.97667     0.0000000
## 1070494        326.9867       176.00000       250.73333     4.0000000
## 1098072        253.9333       280.00000       292.90667     0.3333333
## 1038142        251.8200       108.66667       125.70667     1.3333333
## 1039871        105.4133        72.00000        56.35000     1.3333333
## 1001077        585.3800       173.00000       342.61000    13.6666667
## 1013042        158.2133        16.66667        38.03667     0.0000000
## 1028321        384.9600       344.66667       318.16667     0.3333333
## 1069899        319.2967        82.00000       133.66667     0.0000000
## 1079185       1187.5633       131.33333       332.71667     0.0000000
## 1053185        674.0000       140.33333       152.48000     0.0000000
## 1057745        351.4033        77.66667       212.97667     0.3333333
##         mou_peav_Mean mou_pead_Mean opk_vce_Mean opk_dat_Mean mou_opkv_Mean
## 1011660     2943.1367    9.23000000     803.6667    1.3333333     1768.0600
## 1049099     1585.2967    0.00000000    1121.0000    0.0000000     1421.3033
## 1001811     1395.5967    0.00000000    1429.3333    0.0000000     1845.5767
## 1000652     4015.3467    0.00000000     243.0000    0.0000000      493.7767
## 1071600     1503.5500    6.23666667    1167.3333    5.0000000     2185.7300
## 1000318     3089.5633    0.04333333     417.6667    0.0000000      825.9367
## 1085979     1210.5467    0.57666667    1643.3333    0.6666667     1479.5100
## 1063639     3179.2900    0.00000000     296.0000    0.0000000      599.0800
## 1000572     2761.3167    0.00000000     235.0000    0.0000000      346.7967
## 1070494     1303.8333   10.10000000    1318.0000    1.3333333     2046.3400
## 1098072     1037.9133    1.34333333     980.3333    1.0000000      843.2533
## 1038142     1241.7633    7.18666667    1438.0000    3.6666667     1719.4200
## 1039871     1217.7200    2.16666667    1438.3333    1.6666667     1428.7333
## 1001077     2548.2467   11.55666667     538.6667   24.0000000     1259.5133
## 1013042     2132.0633    0.00000000     425.3333    0.0000000     1269.0000
## 1028321     1022.7233    0.02666667    1399.3333    0.3333333     1088.8133
## 1069899     1434.5233    0.00000000     882.6667    0.0000000     1169.9000
## 1079185      879.0967    0.00000000     710.0000    1.0000000     1459.1633
## 1053185     1346.4233    0.00000000    1112.3333    0.0000000     1105.4333
## 1057745     1777.2500    0.02333333     738.3333    0.0000000     2687.3133
##         mou_opkd_Mean drop_blk_Mean complete_Mean callfwdv_Mean callwait_Mean
## 1011660     0.9733333     126.33333     1381.6667             0   182.0000000
## 1049099     0.0000000     118.66667     1119.0000             0   165.0000000
## 1001811     0.0000000      67.66667     1189.3333             0   212.6666667
## 1000652     0.0000000      88.33333     1894.3333             0   146.6666667
## 1071600     6.2766667     119.66667     1354.6667             0   135.6666667
## 1000318     0.0000000      71.66667     1186.6667             0   126.6666667
## 1085979     0.9633333     219.00000     1464.3333             0   199.6666667
## 1063639     0.0000000     179.00000     1255.3333             0   127.0000000
## 1000572     0.0000000      44.33333     1203.0000             0   170.6666667
## 1070494     1.0766667      89.66667      977.0000             0   168.6666667
## 1098072     2.5933333      78.00000      911.3333             0   146.0000000
## 1038142     8.1866667      84.00000     1261.3333             0     2.6666667
## 1039871     2.3100000     102.33333     1056.0000             0   195.0000000
## 1001077    30.9233333      84.66667      944.6667             0     4.0000000
## 1013042     0.0000000     293.66667      843.0000             0    24.0000000
## 1028321     0.5733333      68.00000     1313.6667             0    24.6666667
## 1069899     0.0000000     154.66667     1429.6667             0    85.3333333
## 1079185     0.9933333     148.66667     1003.6667             0     0.3333333
## 1053185     0.0000000      81.66667     1741.6667             0    61.6666667
## 1057745     0.0000000     119.00000      965.0000             0    45.3333333
##         months uniqsubs actvsubs   adjrev adjqty avgrev avg6mou avg6qty avg6rev
## 1011660     31        2        1 27071.30  68667 902.38    7217    2470     566
## 1049099     13        1        1  6958.35  32050 579.86    6597    3256     779
## 1001811     48        1        1  7172.03  36392 199.22    3457    1639     519
## 1000652     51        2        2 19432.46  83893 396.58    6504    2593     558
## 1071600      8        1        1  2113.88  10996 301.98    3990    1712     333
## 1000318     52        1        1 12625.72  77419 247.56    5589    2176     485
## 1085979      8        1        1  1727.85  19017 246.84    4672    2887     260
## 1063639     12        2        2  5177.66  18781 470.70    5297    1830     429
## 1000572     51        5        5 10061.21  98705 201.22    4674    2673     296
## 1070494     14        1        1  2593.03  19549 199.46    4201    1993     300
## 1098072     48        4        2  8508.59  57416 283.62    2406    1425     333
## 1038142     23        2        1  4811.24  37152 300.70    3369    1840     208
## 1039871     20        1        1  3041.45  41909 160.08    4017    2759     237
## 1001077     48        2        1 11685.87  61539 248.64    5347    1972     413
## 1013042     28        1        1  7436.81  13377 309.87    3428     949     726
## 1028321     23        2        2  2253.18  44228 102.42    3322    2563     129
## 1069899     11        1        1  3273.40  17803 363.71    4272    2071     306
## 1079185      7        3        2  1349.83   4847 224.97    2254     808     225
## 1053185     14        1        1  2792.92  22335 214.84    3525    2140     291
## 1057745     13        1        1  3330.00  12630 277.50    5321    1381     310
##         hnd_price phones models eqpdays
## 1011660  99.98999     19      9      58
## 1049099 129.98999      4      3      62
## 1001811 149.98999      4      4     406
## 1000652 199.98999      8      6     162
## 1071600 129.98999      3      2      35
## 1000318 149.98999      8      4     187
## 1085979 149.98999      2      2      30
## 1063639  29.98999      2      2     122
## 1000572  29.98999      7      5     225
## 1070494 199.98999      7      4      56
## 1098072 199.98999     16      8      -1
## 1038142 199.98999      4      4     119
## 1039871 149.98999      8      4      95
## 1001077 199.98999      8      5     126
## 1013042 149.98999      6      3      84
## 1028321 149.98999      5      4      48
## 1069899  99.98999      7      4      17
## 1079185 129.98999      2      2      49
## 1053185 129.98999      3      2     146
## 1057745 199.98999     20      8      44

If we look at the characteristics of each top customer, we do not really see any relationship except for the two most variables adjqty (adjusted total number of calls) and adjrev (adjusted total revenues). and those customers are the ones with the highest ones.

fviz_pca_var(pca_scaled, col.var = "contrib")

Here we can clearly see that there are strong correlation between some variables and that we can possibly sort the clients by this columns.

fviz_pca_biplot(pca_scaled, repel = TRUE)
## Warning: ggrepel: 87554 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Here we confirm that there are differences. The variables that are to the right correspond to the most important variables of the first PC, and the ones that are on the top to the second one.

Let’s see the customers by the first and the second component.

fviz_pca_ind(pca_scaled,
             label = "none", # hide individual labels
             habillage = customers_original.df$churn, # color by groups
             palette = c("#00AFBB", "#E7B800"))

Here it is not clear in this two PCA to differentiate the churn groups. However the overlap can be because it is in dimension 2 and the points live in dimension 100. Now we will use factor analysis and compare the results. Let’s see if we can divide the customers and also if the results are similar to the PCA.

Factor Analysis

# gives error because the data is too large
# x.f = factanal(customers_numeric_no_corr.df, factors = 5, rotation="none", scores="regression")

colsImportant = c()

colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:21]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 2], decreasing = TRUE)[1:8]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:5]))

colsImportant = unique(colsImportant)

As the variables are too big for the factor analysis, we will reduce it. We use PCA with the most imporant variables of the PCs to reduce it. We use the ones that are high enough, given by the graphs above.

colsNotImportant = setdiff(names(customers_numeric_no_corr.df), colsImportant)
    
customers_numeric_no_corr_small.df = customers_numeric_no_corr.df

for (i in colsNotImportant) {
    index = which(names(customers_numeric_no_corr_small.df) == i)
    
    if (length(index) != 0) {
        customers_numeric_no_corr_small.df = customers_numeric_no_corr_small.df[, -index]
    }
}

After removeing the not so important variables we do the Factor Analysis.

x.f = factanal(scale(customers_numeric_no_corr_small.df), factors = 5, rotation="none", scores="regression")

cbind(x.f$loadings, x.f$uniquenesses)
##                    Factor1      Factor2       Factor3       Factor4
## totmrc_Mean     0.53855899  0.052369938  0.0675643340  0.2445909659
## datovr_Mean     0.05284515  0.492075584  0.0803169986 -0.0108893091
## drop_vce_Mean   0.59235984 -0.032664000  0.3095817836  0.0355732333
## drop_dat_Mean   0.03072469  0.483688287  0.0883024561 -0.0024309374
## blck_dat_Mean   0.02606667  0.209065067  0.0440793503 -0.0005290235
## unan_vce_Mean   0.65630899 -0.055839058  0.4203812132 -0.0008397362
## unan_dat_Mean   0.04088886  0.345591459  0.0551863893 -0.0035888717
## inonemin_Mean   0.58090242 -0.084749786  0.4662057519  0.1332624854
## mou_cvce_Mean   0.93069733 -0.002614453 -0.0141734684 -0.1527130943
## mou_rvce_Mean   0.89421686  0.005914394 -0.0630970632  0.1580234480
## owylis_vce_Mean 0.71277751 -0.032197498  0.3891763949  0.0708396563
## mouowylisv_Mean 0.68895981 -0.001346128  0.1514976843 -0.0064457545
## iwylis_vce_Mean 0.56603953 -0.053063088  0.3048957010  0.0802807222
## mouiwylisv_Mean 0.58762256  0.003304955  0.0002544016  0.0089540951
## peak_dat_Mean   0.07385645  0.833785092  0.1359041825  0.0127266605
## mou_peav_Mean   0.87389769 -0.002093481 -0.0265600962  0.4480372365
## mou_pead_Mean   0.06888616  0.740078717  0.0933773012  0.0104945287
## opk_vce_Mean    0.77213653 -0.071461007  0.5127754955 -0.1556569717
## opk_dat_Mean    0.08801645  0.813932604  0.1542809227 -0.0256235487
## mou_opkv_Mean   0.88465594  0.003012529 -0.0358644036 -0.4553836517
## mou_opkd_Mean   0.06291549  0.595742734  0.0862516667 -0.0234392301
## drop_blk_Mean   0.57111855  0.013738250  0.3031792554  0.0101308494
## complete_Mean   0.85049239  0.005027087  0.4625418001  0.1168874594
## callwait_Mean   0.61248782 -0.063927624  0.2754300741  0.1755496122
## adjqty          0.50878967 -0.032290371  0.2982310055  0.2845531180
## avgrev          0.62604282  0.061862832  0.1179593977  0.2704399729
## avg6mou         0.89154531  0.025405445  0.1442287252  0.0313184693
## avg6qty         0.75471724 -0.025153759  0.4623494786  0.1879025962
## avg6rev         0.65380407  0.069354325  0.0955385001  0.2925922964
##                      Factor5           
## totmrc_Mean     -0.090440234 0.63465199
## datovr_Mean     -0.001762682 0.74860201
## drop_vce_Mean    0.013234005 0.55078548
## drop_dat_Mean   -0.007924692 0.75722698
## blck_dat_Mean    0.002085932 0.95348045
## unan_vce_Mean   -0.013848103 0.38922335
## unan_dat_Mean   -0.010431206 0.87581152
## inonemin_Mean    0.323499082 0.31561092
## mou_cvce_Mean   -0.327158820 0.00500000
## mou_rvce_Mean    0.408690117 0.00500000
## owylis_vce_Mean  0.084330788 0.32726973
## mouowylisv_Mean  0.017974806 0.50199427
## iwylis_vce_Mean  0.305330467 0.48414804
## mouiwylisv_Mean  0.254505903 0.58988678
## peak_dat_Mean   -0.019793636 0.28034492
## mou_peav_Mean   -0.175740166 0.00500000
## mou_pead_Mean   -0.033018032 0.43761949
## opk_vce_Mean     0.259058323 0.04441989
## opk_dat_Mean    -0.001130737 0.30531189
## mou_opkv_Mean    0.071715371 0.00500000
## mou_opkd_Mean   -0.015929955 0.63287576
## drop_blk_Mean   -0.014028861 0.58146595
## complete_Mean   -0.048118017 0.04671952
## callwait_Mean    0.217384431 0.46681364
## adjqty           0.084684189 0.56305404
## avgrev          -0.062035760 0.51331861
## avg6mou          0.004470705 0.18269654
## avg6qty          0.143947208 0.15996598
## avg6rev         -0.059521766 0.46946902

This is good but let’s graph each factor and see if we can deduce some things.

par(mfrow=c(3,1))
barplot(x.f$loadings[,1], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,2], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,3], names=F, las=2, col="darkblue", ylim = c(-1, 1))

barplot(x.f$loadings[,4], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,5], las=2, col="darkblue", ylim = c(-1, 1))

In the graph, at first sight we can discard the 4, 5 factors as they are not pretty much meaningful in any manner. Also the factor 3 does not look like anything that important.

However, factors 1 and 2 show really interesting inverse correlations. They differ on the exact variables and show that the characteristics of the two type of customers are opposite. They are the contrary, this could mean that ones are the ones with churn = 1 (left the company) and the others the ones that stay. Also, looking at the data and the variables, the first factor can be the customers that remain in the company, and the second factor the ones that leave the company.

This means that the first factors can contribute to the churn variable to be negative, and the opposite for the second factor.

par(mfrow=c(3,1))
barplot(x.f$loadings[,1], names=F, las=2, col="darkblue", ylim = c(-1, 1))
barplot(x.f$loadings[,2], las=2, col="darkblue", ylim = c(-1, 1))

rm(list = c("colsImportant", "colsNotImportant", "i", "index")) # clean the environment.

Clustering

Let’s see if now with our previous hypothesis we can group the customers in the ones that left and the ones that did not leave.

First of all, let’s see whats is the optimal number of clusters. However, our data is too large to test it, so we are gonna do a simple sample with cross validation to see the optimal value.

nFolds = 6
folds = sample(rep(1:nFolds, length.out = nrow(customers_numeric_no_corr_small.df)))
X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 1), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))

X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 3), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))

X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 6), ]
#print(fviz_nbclust(X, kmeans, method = 'wss'))

We just print a couple of graphs because otherwise the pc cannot handle it. After looking at each portion of the data set, it looks like they are fairly similar and that the optimal number of cluster could be around 3,4 or 5. This graph means that the minimum distance from the cluster are when they are 3 to 5.

Now, let’s try to see what is the optimal number of clusters looking at the silhouette distance, that is the average width of the silhouette graph.

for (i in 1:nFolds) {
    X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == i), ]

    print(fviz_nbclust(X, kmeans, method = 'silhouette'))
}

Here, the graph indicates that mostly 3 clusters is the base clustering that we can make. And finally we will use a simple bostrapping to really know the best. However, as bostrapping is really computationally expensive, we will do it for just one fold.

X = as.data.frame(scale(customers_numeric_no_corr_small.df))[which(folds == 1), ]

fviz_nbclust(X, kmeans, method = 'gap_stat', k.max = 8)

And after a while, in the graph we can see that the best number is 3 to 4.

If we group all the conclusions we know that the best number of clusters is 3, so that is what we are going to use. Also we are going to scale all the data as we explained earlier to get more variate readings.

Let’s start with a normal kmeans and se the clusters that it makes.

X = as.data.frame(scale(customers_numeric_no_corr_small.df))

fit = kmeans(X, centers=3, nstart=100)
groups = fit$cluster

Now, we will graph the number of ocurrencies for each group.

barplot(table(groups), col="blue")

Here we see that there is a group with most of the customers, and other not so big. However, the third group is almost ineligible. The hypothesis could be as the one we had before. One group can correspond to customers that left the company and other to the ones that remain. First, we will develop this cluster further, and then let’s see if we can get the same hypothesis with other clustering methods.

centers = fit$centers

for (i in 1:3) {
    bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,100), main=paste("Cluster", i,": Group center in blue, global center in red"))
    print(points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19))
}

## NULL

## NULL

## NULL

Here, we can clearly see that the group with less values, is like a mix cluster. The first cluster has much more higher values that the second cluster, and this can mean that the customers from the first group are more radical and therefore have more probabilty to leave the company.

What about plotting the clusters?

fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+scale_fill_brewer(palette="Paired")

Here we can see some minor differences and see that the most populated cluster (cluster 2) is at the left foremost part, followed by the first cluster and finally the last. This shows that number 3 are the most extreme customers and 2 the most common.

What about two clusters?

fit2 = kmeans(customers_numeric_no_corr_small.df, centers=2, nstart=100)
groups2 = fit2$cluster
barplot(table(groups2), col="blue")

centers2 = fit2$centers

for (i in 1:2) {
    bar1=barplot(centers2[i,], las=2, col="darkblue", ylim=c(-2,100), main=paste("Cluster", i,": Group center in blue, global center in red"))
    print(points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19))
}

## NULL

## NULL

With two clusters we can see the same as the previous test but only the two most populated clusters. That is why we are going to use 3 clusters as it is more precise.

fviz_cluster(fit2, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+scale_fill_brewer(palette="Paired")

Just the same so we will delete it.

rm(list = c("fit2", "groups2"))
#fit.kmeans <- eclust(X, "kmeans", stand=TRUE, k=3)
#fviz_silhouette(fit.kmeans)

Because the data set is too large, we cannot compute the silhouette distance. But we will use adujustedRandIndex to verify if we get similar results with other clustering.

Mahalanobis distance K-Means

S_x = cov(X)
iS = solve(S_x)
e = eigen(iS)
V = e$vectors
B = V %*% diag(sqrt(e$values)) %*% t(V)
Xtil = scale(X,scale = FALSE)
X_maha = as.data.frame(Xtil %*% B)

Here we get the mahalanobis distance instead of the euclidean.

fit.mahalanobis = kmeans(X_maha, centers=3, nstart=100)

groups = fit.mahalanobis$cluster
centers=fit.mahalanobis$centers

colnames(centers)=colnames(X)
centers
##   totmrc_Mean datovr_Mean drop_vce_Mean drop_dat_Mean blck_dat_Mean
## 1  0.08227674  0.05975652     0.1962781   0.038779652  -0.023503668
## 2 -0.01407061 -0.01199369    -0.0336102  -0.006952899  -0.006260222
## 3  0.02877424  7.00291711     0.2400984   1.276549078  40.441387282
##   unan_vce_Mean unan_dat_Mean inonemin_Mean mou_cvce_Mean mou_rvce_Mean
## 1     0.2021039  0.0001137759    0.17460727     0.4746795     0.6589273
## 2    -0.0344434 -0.0073110620   -0.02992108    -0.0812676    -0.1125608
## 3    -0.3996330 28.6917362272    0.29912291     0.5200136    -0.2656295
##   owylis_vce_Mean mouowylisv_Mean iwylis_vce_Mean mouiwylisv_Mean peak_dat_Mean
## 1       0.4245518      0.25294551      0.17475016      0.28971835   0.018624314
## 2      -0.0724663     -0.04316964     -0.02989372     -0.04951674  -0.001858567
## 3      -0.3973632     -0.25785128      0.09536336     -0.01530147  -5.213056272
##   mou_peav_Mean mou_pead_Mean opk_vce_Mean opk_dat_Mean mou_opkv_Mean
## 1  -0.039239086  -0.011660395   0.90195668  0.030409262     1.1467535
## 2   0.006624845  -0.003516711  -0.15414852 -0.004628858    -0.1960438
## 3   0.323314969  21.680425029  -0.07888063 -2.238560808     0.1294568
##   mou_opkd_Mean drop_blk_Mean complete_Mean callwait_Mean      adjqty
## 1   0.026568443    0.29415167    0.29642035    0.06478493  0.08632330
## 2  -0.006221988   -0.05051968   -0.05068710   -0.01104278 -0.01476367
## 3   6.613490910    0.94941445    0.08232185   -0.12082083  0.03426782
##        avgrev     avg6mou    avg6qty      avg6rev
## 1  0.06840964  0.65719883  0.2310926  0.014523365
## 2 -0.01176314 -0.11231681 -0.0395699 -0.002545864
## 3  0.27582883 -0.06328229  0.2753835  0.249598249

After doing the mahalanobis distance kmeans, let’s look at the distribution of the clusters.

barplot(table(groups), col="blue")

So looks like they are the same! The only thing that changes is the reference number of the clusters but that does not really matter.

Let’s see if the clusters centers are similar to the euclidian distance kmeans.

for (i in 1:3) {
    bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
    points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)
}

Here, it shows that they are not really the same. It is a little bit weird because the cluster that did not matter has a lot of outlier variables , whereas the most populated cluster is almost 0. Let’s do a little bit more research and see what it is going on.

fviz_cluster(fit.mahalanobis, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal()+scale_fill_brewer(palette="Paired")

Here we can conclude that the most populated clusters remian the same, however cluster number one is not the extreme one, it is like the outlier variables.

adjustedRandIndex(fit$cluster, fit.mahalanobis$cluster) 
## [1] 0.3844098

Here we see that they are not really that similar (<<1) but this could be because of the least populated cluster. In the end, I think that the mahalanobis cluster is more accurate as it divides de data in to and also the outliers. We will try PAM now and see if this hypothesis is supported.

PAM

Partitioning (clustering) of the data into k clusters around medoids.

# fit.pam = eclust(X, "pam", stand=TRUE, k=3, graph=F)
# ERROR pam only allows mas of 65536

PAM cannot handle a lot of data and that is why I have removed half of the rows. However, we can take in mind that the results in PAM will not be as accurate because we are using a sample, half of what we usually have. It is true that I picked random samples so it can be more or less accurate, but we have to take it in mind. Also we will remove some variables because otherwise it crashes.

colsImportant = c()

colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:10]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 2], decreasing = TRUE)[1:4]))
colsImportant = c(colsImportant, names(sort(pca_scaled$rotation[, 1], decreasing = TRUE)[1:3]))

colsImportant = unique(colsImportant)

As the variables are too big for the factor analysis, we will reduce it. We use PCA with the most imporant variables of the PCs to reduce it. We use the ones that are high enough, given by the graphs above.

colsNotImportant = setdiff(names(X), colsImportant)
    
X = scale(customers_numeric_no_corr_small.df)

for (i in colsNotImportant) {
    index = which(names(X) == i)
    
    if (length(index) != 0) {
        X = X[, -index]
    }
}

nFolds = 3
folds = sample(rep(1:nFolds, length.out = nrow(X)))

X = as.data.frame(X[which((folds == 1)), ])
#fit.pam = eclust(X, "pam", k=3, graph=F)
#fviz_cluster(fit.pam, data = X, geom = c("point"), pointsize=1)+ theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = F)+scale_fill_brewer(palette="Paired")
#adjustedRandIndex(fit$cluster, fit.pam$clustering) 

Even if we try to reduce the data set is does not work. It looks like PAM cannot be performed on this big data set, therefore we do not do it. Also we will keep using the super small data set for the others, because otherwise R breaks when we try to perform the other classifications.

Kernel K-Means

Let’s start with the Gaussian kernel kmeans.

#fit.ker = kkmeans(as.matrix(X), centers=3, kernel="rbfdot") # Radial Basis kernel (Gaussian)
#object.ker = list(data = X, cluster = fit.ker@.Data)

#fviz_cluster(object.ker, geom = c("point"), ellipse=F,pointsize=1)
#adjustedRandIndex(fit$cluster, fit.ker$cluster) 

The same as PAM, Kernel Kmeans looks like cannot be performed in this data set. We could try further reduce the data set, but it will be useless compared to the original data set because our conclusion will be too far out the original ones and most probably not correct.

Hierarchical Clustering

What about hierarchical clustering? Is is similar to the normal kmeas that we perform at the beginning? (All the Hierarchical Cluster is commented because it was breaking R whenever I tried to perform the html.)

#d = dist(scale(X), method = "euclidean")
#hc = hclust(d, method = "average")
#hc$labels = names

#plot(hc)

Here everything is a mess and we cannot differentiate anything. Let’s try using ggplot and see if we can draw some conclusions.

# fviz_dend(x = hc, k = 3,color_labels_by_k = TRUE,cex = 0.8,type = "phylogenic",repel = TRUE)+theme(axis.text.x=element_blank(),axis.text.y=element_blank())

After more than 4 hours waiting, nothing showed so we decided to not to do the dendogram. This is because it is really constitutionally expensive for a data set for so many entries. Moreover, the dendogram would not be that useful as there should be only two types of customers and all in the same level.

#groups.hc = cutree(hc, k = 3)

#fit.small = kmeans(X, centers=3, nstart=100)

We get a kmeans with the small data set.

#adjustedRandIndex(fit.small$cluster, groups.hc)

Even though using almost the same data set, they do not look anything similar. This could be because we are using a small data set, but mainly is because a dendogram is not the best idea for classification for our data. (It gives -0.552e-5)

EM Clustering

Finally let’s try the Expectation-Maximization clustering. This is based in kmeans but uses probabilities for clusters to assign a customer for each cluster. The goal is to maximize the likelihood of the data, where each customer as a certain probability to fall inside each cluster.

#res.Mclust <- Mclust(scale(X))
#summary(res.Mclust)
#head(res.Mclust$classification)
#fviz_mclust(object = res.Mclust, what = "BIC", pallete = "jco") + scale_x_discrete(limits = c(1:10))

The EM Clustering does not work with this data as it is too big. It happens the same as with PAM. Therefore no clusterplot can be done neither.

#fviz_mclust(object = res.Mclust, what = "classification", geom = "point",  pallete = "jco")
#adjustedRandIndex(fit$clusters, res.Mclust$classification) 

Clustering Final

In the end, after trying all the possible clusters we finally decided that the normal kmeans with the mahalanobis distance as it is the best one that sorts the population. This is because as it is a simple model, it can be performed in really big data sets (such as this ones) whereas other more complex kmeans cannot (as explained earlier). We could also choose the euclidean distance kmeans but the mahalanobis sorts better the data as explain previously.

Conclusion

In the end, we have clearly seen that everything that we did correlated with two groups.

PCA gave us the conclusion that there were some variables strongly related with the output and seen those. Moreover, based on those variables we saw that the most important variables in on PC were characteristic of those who do not tend to leave the company and the opposite for the other PC.

For the Factor Analysis, we saw that there were only two really important factors as the PCA showed as. And also we could find two latent variables that were strongly correlated to the churn variable.

Finally, for the clustering part, the only clusters that worked where the simple ones. However, they really sorted the data pretty well, especially the mahalanobis distance kmeans that sorted the data in two really clear cluster and another with all the outliers for the second PC.

To sum up, we could see which were the variables most imporant for each group and then we could predict if a customer will leave the company depending on those. But for that, it is better to use some supervised learning tools.